A Computer Vision-Based Yoga Pose Grading Approach Using Contrastive Skeleton Feature Representations

The main objective of yoga pose grading is to assess the input yoga pose and compare it to a standard pose in order to provide a quantitative evaluation as a grade. In this paper, a computer vision-based yoga pose grading approach is proposed using contrastive skeleton feature representations. First, the proposed approach extracts human body skeleton keypoints from the input yoga pose image and then feeds their coordinates into a pose feature encoder, which is trained using contrastive triplet examples; finally, a comparison of similar encoded pose features is made. Furthermore, to tackle the inherent challenge of composing contrastive examples in pose feature encoding, this paper proposes a new strategy to use both a coarse triplet example—comprised of an anchor, a positive example from the same category, and a negative example from a different category, and a fine triplet example—comprised of an anchor, a positive example, and a negative example from the same category with different pose qualities. Extensive experiments are conducted using two benchmark datasets to demonstrate the superior performance of the proposed approach.


Introduction
Yoga pose grading aims to quantitatively evaluate yoga poses so that it can realize yoga pose recognition (how a yoga pose is performed) and evaluate pose quality (how well a yoga pose is performed) [1,2]; which can distinguish different movements by analyzing pose characteristics. The most important aspect of yoga exercise is to do it correctly, since any wrong position can be counterproductive and possibly lead to injury [3][4][5]. However, not all users have access to a professional instructor. Many yoga beginners could only learn yoga by self-study, such as mechanically copying from a recorded yoga video or remotely watching a live yoga session. Consequently, they have no way of knowing if their pose is good or poor without the help of the instructor. Therefore, automatically evaluating yoga poses is critical to the recognition of yoga poses and in providing suggestions to alert learners [6].
There are various types of artificial intelligence-based solutions for yoga pose analysis that have been developed in the literature, including (i) the wearable device-based approach [7,8], (ii) the Kinect-based approach [9][10][11], and (iii) the computer vision-based approach.
First, wearable device-based approaches usually require attaching sensors to each joint of the human body during yoga exercise. Wu et al. proposed a pose recognition and quantitative evaluation approach [7]. A wearable device with eleven inertial measurement units (IMUs) is fixed onto the human body in order to measure yoga pose data. Then, the artificial neural network and fuzzy C-means are combined to classify the input pose into a category. In addition, the angular differences between nonstandard parts (e.g., the yoga student) and the standard pose model (e.g., the yoga teacher) are calculated to guide yoga learners. Puranik et al. proposed a wearable system [8] where a wrist subsystem is used to monitor a pose with the help of a flex sensor, and a waist subsystem is built to monitor the pose with the use of a flex sensor. However, such solutions are impractical for long-term applications due to their maintenance concerns.
Second, Kinect-based approaches deploy the Kinect device to extract features. Chen et al. captured the yoga learner's body map and extracted the body contour [9]. Then, a fast skeletonization technique was used as a human pose feature for yoga pose recognition. Trejo and Yuan presented a yoga pose classification approach by employing the KinectV2 camera and the Adaboost classifier algorithm for recognizing six poses [10]. Islam et al. presented a yoga pose recognition method that leverages fifteen keypoints detected from Kinect camera images and uses pose-based matching for pose recognition [11]. However, the depth sensor-based camera required in these solutions may not be always available for users.
Third, computer vision-based approaches use non-invasive computer vision techniques to extract pose characteristics and perform pose analysis, as reviewed in Section 2. They are more suitable for amateur training and home exercise. Many studies have begun to examine how to utilize human pose analysis techniques in the field of intelligent sports learning since the invention of human pose analysis techniques [12].
Computer vision-based yoga pose grading is a difficult task due to the following challenges. The first challenge is due to the lack of a yoga pose grading benchmark as imagelevel annotation is expensive; hence, the supervised representation learning might not be feasible. The second challenge lies in the fundamental difference between the learner's pose image and the standard pose image. The aggregated features using multiple deep features from the pre-trained models might be more robust than a single type of feature [13]. In addition, human body skeleton information might be robust to handle this diversity. To tackle these challenges, the contrastive learning technique [14][15][16] is a potential solution.
Its key idea is to conduct a discriminative learning approach to learn encoded feature representations, in which similar sample pairs remain close together, whereas different sample pairs remain widely apart. It has been successfully verified in many computer vision tasks such as image classification [17] and human activity recognition [18,19].
Motivated by this, a computer vision-based yoga pose grading approach using contrastive skeleton feature representations is proposed in this paper. The following are the main contributions of this paper:

•
To tackle the challenge of variation between the learner's pose image and the standard pose image, contrastive learning is introduced in this paper to develop a yoga pose grading approach that uses contrastive skeleton feature representations instead of diverse and complicated backgrounds in the images. The proposed approach is able to learn discriminative features from human skeleton keypoints for yoga pose grading, as verified in our experimental results. • To tackle the challenge of the establishment of contrastive examples used for discriminative feature learning, a novel strategy is proposed in this paper to compose the contrastive examples using both the coarse triplet example, which consists of an anchor, a positive example from the same category, and a negative example from a different category, and the fine triplet example, which consists of an anchor, a positive example, and a negative example from the same category with different pose qualities.
The rest of this paper is organized as follows. Section 2 provides a brief review of the existing research works in yoga pose classification and yoga pose grading. Then, the proposed yoga pose grading approach using contrastive skeleton feature representations is presented in Section 3, and then evaluated in extensive experiments in Section 4. Limitations and future studies are also provided in Section 4. Finally, this paper is concluded in Section 5.

Data
Method Recently, deep learning has achieved an impressive performance in addressing the yoga pose classification task due to its powerful feature learning capability. Yadav et al. proposed a hybrid deep learning framework where the convolutional neural network (CNN) layer is used in each frame to extract features from human body keypoints returned by OpenPose [33], followed by the long short-term memory (LSTM) layers performing temporal learning [20]. Maddala et al. proposed to integrate joint angular movements along with the joint distances in a spatiotemporal color-coded image, which is further analyzed using a CNN model [21]. To address the privacy issue in the camera-based solution, Gochoo et al. proposed a privacy-preserving yoga pose recognition by utilizing a deep CNN and a lowresolution infrared sensor [22]. The OpenPose-based skeleton keypoint extraction and the CNN model were also studied in [23]. Special attention was paid to applying a rule-based classification in order to detect fall risk during yoga exercise in [24]. A benchmark dataset for fine-grained yoga pose classification and several CNN baselines are provided in [25]. Other examples of deep learning-based yoga pose classification include the image-based CNN model and transfer learning [26,27], and the three-dimensional CNN model for yoga videos [28].

Yoga Pose Grading
In contrast to the objective of yoga pose classification to infer the yoga pose class label, yoga pose grading aims to automatically quantify how well people perform yoga actions. Despite the fairly popular studies on yoga pose classification, there are not many works on yoga pose grading. Patil et al. proposed to identify yoga pose variations between different persons by comparing the similarity between the speeded up robust feature (SURF) extracted from the input pose images [29]. Chen et al. proposed to capture the user body map, and then apply the skeleton to extract the human body feature points to identify the correct pose [30]. Chaudhari

Motivation and Research Challenge
Despite the fairly popular studies in yoga pose classification, there is a lack in yoga pose grading research, except the works in [29][30][31][32]. The limitations of existing works lie in two aspects: • First, it is a challenge to rely on the whole pose image for pose grading due to the fundamental difference between the learner's pose image and the standard pose image. To address this, the proposed approach exploits the skeleton keypoints from the pose image, or more specifically, the discriminative features that are learned from the contrastive skeleton feature representations. This is in contrast to what the whole pose image is used in [29]. • Second, the domain knowledge is required to define customized rules for specific yoga pose grading. It is difficult for them to handle new types of yoga poses. For example, the methods in [30][31][32] require the domain knowledge to define the rules in order to evaluate yoga poses by checking characteristics (e.g., positions or angles) of the skeleton keypoints of various yoga postures. To address this, the proposed approach relies on machine learning methods in order to provide general yoga grading without the need for additional domain knowledge.
In summary, to tackle these challenges, a pose grading approach using contrastive skeleton feature representations is proposed in this paper.

Proposed Approach
The objective of the proposed yoga pose grading approach is to input two yoga pose images from the learner and the coach, respectively, and then extract the human skeleton keypoints and feed them into the pose feature encoder. Finally, the feature similarity between them is calculated in order to obtain a pose grade. As illustrated in Figure 1, the proposed framework consists of a model training process and a model inference process. More specifically, the model training process consists of three key components: (i) construction of contrastive examples, (ii) skeleton extraction, (iii) pose feature encoding using contrastive skeleton feature representations. The model inference process consists of (i) skeleton extraction, (ii) pose feature encoder, and (iii) feature similarity comparison. All of these components are described in the following sections in detail.

Construction of Contrastive Examples
The proposed framework exploits the contrastive learning concept, which applies a weight-sharing neural network on multiple inputs. This is a natural tool to compare various pose images. To learn effective discriminative representations, the composition of multiple contrastive data is crucial in defining the contrastive prediction tasks. For that, we exploit the triplet example [34] in this work. The idea is to learn discriminative feature embedding representations where similar features are projected onto the nearby region, whereas dissimilar features are projected far away from each other. To be more specific, we propose to use both the coarse triplet example-comprised of an anchor, a positive example from the same category, and a negative example from a different category, and the fine triplet example-comprised of an anchor, a positive example, and a negative example from the same category with different pose qualities. To illustrate the difference between these two types of triplet examples, a few examples are presented in Figure 2.

Skeleton Extraction
Due to the fact that some yoga poses are too complicated to be captured from a single point of view, the utilization of skeleton keypoints of the human targets in the pose images may be more suited for analyzing various poses than the whole pose image. In view of this, the proposed framework exploits the human skeleton keypoints in yoga pose grading instead of analyzing the whole pose image that is usually difficult due to diverse backgrounds and human appearance.
In this paper, we adopt Mediapipe [35], which utilizes a state-of-the-art machine learning model BlazePose [36] for skeleton keypoint extractions. It detects human body parts and tracks keypoints on these body parts. Each of these keypoints represents a two-dimensional coordinate that yields values in the range of (0, 1) corresponding to the position of the pixel in the image, normalized with respect to image width and height. The implementation details are provided as follows. The static_image_mode is set to True as we process the single pose image as the input, the minimum_detection_confidence is set to the default value 0.5, and the model_complexity is set to 2 to obtain the most accurate keypoint results. After Mediapipe is applied to the input pose image, 33 keypoints of the human body are detected in one pose image. Each keypoint of the human body has two coordinate values; therefore, an image contains (2,33) coordinate data values that will be used in the following pose feature encoder.

Pose Feature Encoding Using Contrastive Skeleton Feature Representations
The proposed approach aims to learn the discriminative representations by maximizing the agreement between similar yoga pose images via a contrastive loss in the latent feature space. It consists of the following key components: • A neural network encoder (denoted as f (·)) that extracts representation vectors from input contrastive data examples. It maps representations to the space where contrastive loss is applied. The detailed network architecture is illustrated in Figure 3. The proposed encoder takes the introduced skeleton points as the input, and then it adopts a sequence of Conv1D layers, where the numbers of filters are 16, 32, 32, 32; each filter has the same kernel size of 15. The batch normalization and average pooling are applied after each Conv1D layer. Finally, the encoded feature is obtained with a dimension of 32. • When the coarse triplet example is used, the encoder takes a triplet example x a , x p , and x n as the input. These three images are processed to extract their respective skeleton points s a , s p , and s n , each of which has a size of (2, 33). Then, they are further processed by a weight-shared encoder network f (·) to obtain their respective features z a , z p , and z n . A triplet contrastive loss is defined as follows [34]: where α c is a margin between positive and negative examples. • On the other hand, when the fine triplet example is used, the encoder takes a triplet example x h , x m , and x l as the input, all of which are from the same category but are of high-quality, medium-quality, and low-quality, respectively. These three images are processed to extract their respective skeleton points s h , s m , and s l , each of which has a size of (2, 33). Then, they are further processed by a weight-shared encoder network f (·) to obtain their respective features z h , z m , and z l . A triplet contrastive loss is defined as follows: where α h and α l are the margins when the high-quality example and the low-quality example are used as anchors, respectively.
where AVG coarse (·) and AVG f ine (·) represent the average loss calculated using the coarse triplet examples and the fine triplet examples in the batch, respectively. In addition, the loss that is obtained from the fine triplet examples is further multiplied by a factor of 5 in this combination (3), as the fine triplet examples are treated as more important in the model training.

Inference
The model inference process consists of (i) skeleton extraction, (ii) pose feature encoder, and (iii) feature similarity comparison. The skeleton extraction and the pose feature encoder are the same as those used in the model training process. Given two input yoga pose images from the student and the teacher (denoted as x s , and x t , respectively), extract the human skeleton keypoints and feed them into the pose feature encoder, before finally calculating the feature similarity between their encoded features z s and z t m to obtain a pose grade as follows: which calculates the dot product between the L 2 normalized z s and z t (i.e., cosine similarity).

Dataset
Two benchmark datasets are used in our experiments.
• Dataset A: This is the yoga pose classification image dataset adopted from Kaggle [37], where 45 categories and 1931 images are selected. In this dataset, images are captured with various resolutions and diverse backgrounds. An overview of these categories is illustrated in Figure 4. • Dataset B: This is the yoga pose grading image dataset that we constructed. In this dataset, 3000 triplet examples are collected, where each triplet example consists of three pose images that belong to the same yoga pose category. These images have various resolutions and diverse backgrounds. Then, professional yoga teachers [38] are engaged to grade these three images with respect to the standard pose image in order to obtain three grades: high-quality, medium-quality, and low-quality. An example of this dataset is illustrated in Figure 5.
These two serve as the benchmark datasets for evaluating and justifying the proposed approach in experiments.

Performance Metrics
The performance of the proposed approach is evaluated using the two types of performance metrics below.
The first method is the pose recognition performance evaluation using Dataset A. Two images (simulating one image from the student and the other image from the teacher) are randomly selected from this dataset. Then, the proposed approach is used to evaluate whether their feature similarity is smaller than a user-defined threshold (it is set to 0.75 in our experiments) in order to make a binary decision of whether they belong to the same category. Subsequently, the following four criteria are defined: Based on the four aforementioned criteria, we further define the following performance metrics: Accuracy, Precision, Recall, and F1.
In this experiment, 1656 pairs of photos are randomly selected from Dataset A, including 828 positive pairs and 828 negative pairs. The second method is the pose feature similarity performance evaluation using Dataset B. The criterion is: The distance between high-quality and low-quality pairs should be larger than that between high-quality and medium-quality pairs, and between low-quality and medium-quality pairs. The proposed approach is evaluated and its performance Accuracy is defined as the ratio between the number of tests where the proposed approach makes the correct decision and the number of total tests. In this experiment, 254 examples from Dataset B are used.

Baseline Approaches
The relevant yoga pose grading works [29][30][31][32] were reviewed in Section 2.2. These approaches are not suitable in our experiments to be able to provide a fair comparison. First, the method in [29] needs to compare the whole pose image, which is different from the proposed approach that uses only skeleton keypoints. Second, the methods in [30][31][32] require domain knowledge to define the rules for checking the angles of the skeleton keypoints of various yoga poses, which is not available for our pose dataset.
In order to conduct a fair experiment to justify the performance of the proposed approach, we define the following two baseline approaches in the performance comparison.
• Baseline Approach 1: This extracts the skeleton keypoints from the input pose image and then builds a virtual skeleton image as follows. The size of the skeleton image is first set to (224, 224), then the background color is set to black, each keypoint is then assigned a unique color, and the connections between them are drawn according to the definition of the keypoints. In addition, the image augmentation method is used in the model training, including a random rotation of up to 30 degrees, random scaling, and cropping with a factor in the interval between 0.8 and 1.0. The MobileNetV3 network [39] is used as the backbone, the cross-entropy loss is used, and the output feature vector length is 128. In the model training, 1931 images from 45 categories are used. Finally, the encoded features are used to compare feature similarity in the inference process. • Baseline Approach 2: This exploits the same model architecture as the proposed approach. However, the cross-entropy loss is used to build a pose classification model. In the model training, 1931 images from 45 categories are used. After the model is trained using Dataset A, the encoded pose feature is used to compare feature similarity in the inference process.

Implementation Details of the Proposed Approach
The implementation details of the proposed approach are provided as follows. The triplet examples are constructed, as described in Section 3. The Mediapipe [35] is applied on each input yoga pose image to extract its 33 skeleton keypoints. Then, the coordinates of these keypoints from the triplet example are used as the input to the proposed approach. In the model training process, 1931 coarse triplet examples and 591 fine triplet examples are used. The initial learning rate is set to 0.005, with a weight decay of 0.1 to prevent model over-fitting. The coordinates are randomly shifted as augmentation by adding a value randomly drawn from a Gaussian distribution with a zero mean and a 0.02 variance. The stochastic gradient descent optimization algorithm is used with an Adam optimizer [40]. In the proposed triplet loss, the margin α c in (1) is set to 0.1, and both margins α h and α l in (2) are set to 0.2. The model is trained for 300 epochs with a batch size of 256 on the Nvidia Tesla V100 GPU, and with the 1.9.0 version of the PyTorch library.

Experimental Results and Discussions
The first experiment evaluated the performance of the yoga pose grading approach, as shown in Table 2. As seen from this table, the proposed approach is able to achieve the best Recall and F1 performance in Dataset A. In the experiment using Dataset B, the proposed approach is able to achieve the best accuracy performance.  Table 3. As seen from this table, the proposed approach is able to achieve the best performance using both coarse contrastive examples and fine contrastive examples. We acknowledge that the proposed approach is not superior to all baseline approaches in terms of the individual performance metric. It is possible to improve the proposed approach in several aspects in future research works. First, more data augmentations can be applied to generate more contrastive pairs, which could further boost the model's performance in learning the discriminative features of different poses. Second, only the skeleton positions are used in the proposed approach; it would be interesting to incorporate other features, such as the geometrical features (e.g., angular or distance) among skeleton keypoints, into the proposed approach.
In addition, there are several interesting areas that warrant further research to address the limitations of the proposed approach. First, the proposed approach performs automated pose grading for a single image. In practice, yoga learners need to perform a complete cycle to exercise a certain pose. To address this, the proposed approach can be extended to perform yoga pose grading frame by frame. However, it would be interesting to study how such grading could be performed by considering temporal information provided by the learners' video instead of processing it frame by frame. Second, the proposed approach provides an overall grade for the yoga pose image. It would be interesting to study the quantitative evaluation of the learners' pose, such as arm angle or distance, so that further interpretable feedback could be provided to improve the motion of the human body in real time.

Conclusions
A computer vision-based yoga pose grading approach has been proposed in this paper. The proposed approach was able to automatically grade the yoga pose image via the learned contrastive skeleton feature representations. The proposed approach was able to produce more accurate pose grading, as verified in our experimental results with the use of two benchmark datasets.